Back

Research Synthesis Methods

Wiley

Preprints posted in the last 30 days, ranked by how well they match Research Synthesis Methods's content profile, based on 20 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Ranked (In)direct Citation Searching in Systematic Reviews: A methodological case study

Woelfle, T.; Fucile, G.; Hirt, J.; Pena, R. C. G.; Vogt, M.; Nordhausen, T.; Ewald, H.; Appenzeller-Herzog, C.

2026-05-27 medical education 10.64898/2026.05.26.26354093 medRxiv
Top 0.1%
61.4%
Show abstract

Systematic Review (SR) is a prosperous study type in modern medicine and beyond. Many SR authors complement their primary database searches by supplementary techniques. Among these, citation-based techniques known as citation searching (CS) are widespread. Unranked Direct CS (UDCS) to identify directly cited and citing literature of seed references is currently most prevalent. Ranked (In)direct CS (RICS) additionally collects co-cited and co-citing literature combined with a ranking and cut-off procedure. However, RICS workflows remain non-standardized and tedious, and associated benefits unclear. This work aims to create a framework for the prospective international comparison of supplementary UDCS and RICS. To prime RICS research, we developed the open-source Co*Citation Network application and assessed parallel supplementary UDCS and RICS retrospectively in three completed SRs and prospectively in one case study. Automated RICS collected and ranked cited, citing, co-cited, and co-citing literature of seed references from OpenAlex database and applied an empirical rank cut-off to approximate the volume of UDCS results. In RICS compared to UDCS, we consistently noted higher overlap with primary database search results. Title/abstract screening in the case study showed a precision (number needed to read) of 1.8% (57) for UDCS and 2.1% (48) for RICS results. After full text screening, two additional articles were included for review, one of which was identified by UDCS and RICS, and one exclusively by UDCS. The present study indicates potential benefits of RICS for SR authors and will enable the formation of a research consortium to compare supplementary UDCS and RICS on larger scale.

2
Automating Screening of Titles and Abstracts in Systematic Reviews: An Assessment of GPT-4o mini

Fazeli, M. S.; Kasireddy, E.; Pourrahmat, M.-M.; Chow, C.; Collet, J. P.

2026-05-20 health informatics 10.64898/2026.05.15.26353334 medRxiv
Top 0.1%
23.2%
Show abstract

Background: Systematic literature reviews (SLRs) are essential in medical research, but are often time-consuming and costly, necessitating more efficient methods while maintaining accuracy. Objective: This study assessed the performance of a GPT-4o mini large language model (LLM) in automating the first phase of study selection based on titles and abstracts in systematic reviews. Specifically, we evaluated whether the model improved efficiency without compromising on quality. Methods: Structured prompts were created for a GPT-4o mini LLM to facilitate title and abstract screening. The model's performance was evaluated against expert human reviewers across five systematic reviews on inclusion rates, sensitivity, specificity, accuracy, positive predictive value, and negative predictive value. Results: The model screened a total of 15,605 records. It included a higher percentage of studies than human screeners, with 3.5% (n=549/15,605) true positives and 14.2% (n=2,218/15,605) false positives. The model achieved an overall accuracy of 85.1%, with a sensitivity of 83.2% and specificity of 85.2%. The positive predictive value was 19.8%, while the negative predictive value was 99.1%. The model was able to screen 1,000 titles and abstracts in 40 minutes, compared to 16 hours required by a human reviewer. Conclusion: This study demonstrated a strong performance and efficiency in the automation of title and abstract screening in SLRs using an advanced LLM. Further refinements could optimize the balance between sensitivity and specificity, supporting broader implementation in evidence synthesis. A hybrid AI-human approach is recommended to ensure accuracy, reduce reviewer burden, and maintain the methodological rigor required for high-quality SLRs.

3
Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models

Das, P.; Schneider, J.; Mayo-Wilson, E.; Kilicoglu, H.; Menke, J. D.; Nam, D.; Ninan, K.; Oberste, J.-P.; Troy, A. M.; Ying, X.; Holt, A. W.; Smalheiser, N. R.

2026-06-04 health informatics 10.64898/2026.06.03.26354854 medRxiv
Top 0.1%
14.5%
Show abstract

Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design. Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs--cohort, case-control, cross-sectional, and case report. Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission. Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.

4
Accounting for Uncertainty in the Null Benchmark in Two-Stage Phase II Trials

Irlmeier, R.; Jin, Z.; Ye, F.

2026-05-18 epidemiology 10.64898/2026.05.14.26353210 medRxiv
Top 0.1%
8.4%
Show abstract

Background Simon two-stage designs for binary endpoints and their time-to-event analogues, including the Kwak and Jung method, rely on a fixed null benchmark. Their Type I error control is valid only when that benchmark is correctly specified. In practice, historical benchmarks are often inconsistent due to small samples, population heterogeneity, changing eligibility criteria, and evolving standards of care. Even modest misspecifications can substantially inflate the Type I error rate, leading to costly advancement of ineffective treatments. Methods We propose the Interval-Null Robust (INR) two-stage design framework that accounts for uncertainty in the historical null benchmark. We define the null hypothesis as a plausible range of clinically uninteresting values: p[isin][p0L, p0U] for binary endpoints and {lambda}[isin][{lambda}0L, {lambda}0U] (or equivalent survival probabilities) for time-to-event endpoints. Type I error is controlled uniformly over the full null interval: sup{theta}[isin]{theta}0 Pr{theta}(Go) [≤] . Under the monotonicity of the Go probability, the supremum occurs at the least favorable null configuration - p0U and {lambda}0L - but the design is not reduced to a point-null formulation. The interval defines the uncertainty set for error control and is used in selecting among feasible designs through robust criteria such as worst-case regret or minimal average expected sample size. Results Across representative planning scenarios for both endpoint types, classic designs calibrated to a single benchmark exhibit substantial Type I error inflation when the true null parameter exceeds the assumed planning value. INR designs maintain the nominal Type I error rate across the full null interval, directly addressing this vulnerability to benchmark misspecification. The robustness-efficiency trade-off can be managed through design constraints and robust optimization criteria while preserving uniform Type I error control. Conclusions INR two-stage designs offer a transparent framework for addressing historical control uncertainty in single-arm Phase II trials. By replacing reliance on a fixed benchmark assumption with a more realistic interval of clinically plausible null values, INR design reduces the risk of false-positive Go-decisions caused by benchmark misspecification. INR applies to both binary and time-to-event endpoints and is implemented in the open-source INRDesign R package and accompanying interactive Shiny app.

5
Benchmarking foundation models for improving confounding control in target trial emulation

Kleper, S. L.; Melamed, R. D.

2026-05-13 epidemiology 10.64898/2026.05.09.26352820 medRxiv
Top 0.1%
7.3%
Show abstract

Machine learning models for causal inference aim to adjust for confounding factors that are associated with both an exposure and an outcome, creating a spurious biased association. But, these methods are rarely empirically evaluated to assess their success in mitigating such bias. Recent advances in knowledge representation, including both foundation models and knowledge graphs, could enrich these models, but rigorous evaluations are needed in order to assess their potential. Here, we ask whether enriching existing causal inference models with knowledge representations from foundation models can improve confounding control. Rather than using semi-simulated data to address this question, we focus on examples of real confounding: we emulate target randomized active comparator trials that are subject to confounding by indication. Our results can guide researchers aiming to develop or apply methods for discovering causal effects from observational data.

6
Audited large language model triage for systematic review screening in national clinical guideline production: validation and prospective deployment

Fagerberg, P.; Sallander, O.; Vikhe Patil, K.; Thunborg, C.; Lundstrom, L.; Berg, A.; Nyman, A.; Borg, N.; Linden, T.

2026-06-03 health informatics 10.64898/2026.06.02.26354724 medRxiv
Top 0.1%
6.3%
Show abstract

Title and abstract screening limit the timeliness of systematic reviews used for clinical guidelines. We evaluated audited large language model (LLM) triage at Sweden's National Board of Health and Welfare. Ten LLMs from five model families were tested on 419 Cochrane reviews comprising 26,892 records, and the selected ensemble was externally validated on 133 reviews including 8,501 records matched to planned guideline topics. The same locked model pair was then used prospectively across 24 systematic reviews in two national guideline programmes. On the 419-review selection benchmark, the selected Gemini-3-flash plus GPT-5.1 ensemble achieved 98.0% (95% CI, 97.3-98.7) mean review-level sensitivity, while topic-matched validation yielded 96.7% sensitivity (95% CI, 93.7-98.9). Prospective deployment screened 74,679 records, placed 63,858 (85.5%) in the AI-excluded pool and reduced estimated first-pass screening effort from 415 to 34 person-days. Across 600 randomly sampled AI-excluded records from the migraine and dementia programmes, none was confirmed as a final false negative after post-unblinding adjudication; across the completed 680-record audit, all 38 final retained records had been AI flagged, whereas locked blinded human consensus missed seven. These findings support locked, audited LLM triage, with human oversight and programme-specific monitoring, for systematic reviews used in national guidelines.

7
The Hypothesis Race Model for evaluation of research findings

Kelly, R. E.

2026-05-29 scientific communication and education 10.64898/2026.05.28.728385 medRxiv
Top 0.1%
4.0%
Show abstract

Null Hypothesis Significance Testing (NHST) remains the dominant paradigm for evaluation of empirical research findings in medicine and the social sciences despite concerns about frequent misinterpretations of those findings. Achievement of "statistical significance," the goal of NHST, often beckons unrealistic conclusions. Helpful would be the addition of a broader, Bayesian perspective of research in terms of progressive readjustment of hypothesis credibility from all sources of evidence. For this purpose, the Hypothesis Race Model (HRM) provides an intuitive Bayesian approach that builds upon NHST-concepts, helping to correct misunderstandings with minimal reeducation. The HRM is an extension of the Bayesian approach by Ioannidis in 2005 that helped to explain "why most published research findings are false." It is powerful enough to serve as the foundation for mathematical models to estimate and reduce the cost of empirical hypothesis testing.

8
Promise vs. Proof in Digital Interventions for Antimicrobial Stewardship: A Systematic Review and Meta-Analysis of Randomized Controlled Trials

Matos Porto, A. P.; Gomes, M. S.; de Oliveira, V. F.; Mwanja, H.; Zhu, N.; Holmes, A.; Levin, A. S.; Costa, S. F.

2026-06-03 infectious diseases 10.64898/2026.06.01.26354656 medRxiv
Top 0.1%
1.9%
Show abstract

Background: Digital antimicrobial stewardship (AMS) interventions, such as clinical decision support systems, audit and feedback platforms, and electronic prescribing tools, have been increasingly adopted to improve antibiotic use. However, the effectiveness of these interventions across healthcare settings remains uncertain, and the certainty of the evidence has not been comprehensively evaluated. The objective of this study was to provide a comprehensive understanding of the role of digital interventions in optimizing antimicrobial use and improving clinical outcomes within a broad spectrum of healthcare settings. Methods: We conducted a systematic review and meta-analysis of randomized controlled trials evaluating digital AMS interventions that followed PRISMA 2020 guidelines and registered in PROSPERO CRD420251178854 and funded by the Wellcome Trust CAMO Net programme. Searches were performed across major databases. Primary outcomes included the appropriateness of antibiotic prescriptions and the antibiotic prescription rate. Secondary outcomes included 30 day mortality, 30 day hospital readmission, and length of hospital stay (LOS). Random effects models were used to pool effect sizes. Risk of bias was assessed RoB 2, and certainty of evidence was rated using GRADE. A Summary of Findings table was prepared to present effect estimates, sample sizes, and evidence certainty. Results: Eleven RCTs met the inclusion criteria, and nine were included in the quantitative synthesis. Digital AMS interventions did not show a significant effect on appropriateness of antibiotic prescribing (RR 0.99, 95%CI 0.93 to 1.05; very low certainty). There was no reduction in antibiotic prescription (RR 0.98, 95%CI 0.88 to 1.09), with substantial statistical heterogeneity and very low certainty. Across clinical outcomes, digital AMS showed no effect on 30 day mortality (RR 0.91, 95%CI 0.77 to 1.09; very low certainty) or 30 day readmission (RR 0.95, 95%CI 0.79 to 1.14; very low certainty). For LOS, results were inconsistent across studies, and the pooled effect showed no clinically meaningful change (MD 0.17 days, 95%CI 0.01 to 0.35; very low certainty). Most trials had some concerns of bias due to deviations from intended interventions. Conclusion: Meta-analyses of digital AMS RCTs showed a lack of evidence with a high level of certainty on antibiotic prescribing or clinical outcomes due to high heterogeneity in interventions and study designs, as well as RCTs' limitations (no adoption/fidelity metrics).

9
The control gap in long COVID research: a meta-epidemiological analysis

Panagiotopoulos, A.-P.; Laskaris, A.; Tsakri, D.; Manoussopoulos, Y.; Anastassopoulou, C.; Tsakris, A.; Ioannidis, J.

2026-05-21 epidemiology 10.64898/2026.05.16.26353381 medRxiv
Top 0.1%
1.9%
Show abstract

Objectives To quantify the frequency of baseline control-group use in published long COVID prevalence studies and assess their key methodological features. Design Cross-sectional meta-epidemiological evaluation of published post-acute COVID-19 prevalence studies, supplemented by a corresponding-author survey. Setting Published studies identified through a systematic review by Hou et al. (2025) and supplementary data obtained through direct email contact with corresponding authors. Participants A total of 440 published long COVID prevalence studies. Main Outcome measures Presence and type of comparator group, reliance on solely self-reported outcomes, acknowledgment of lack of a control group among uncontrolled studies, and availability of additional comparator data through author survey. Results Among 440 studies, 372 (84.5%) reported no control group on their publication. Healthy or uninfected comparators were reported in 55 studies (12.5%) and other comparator types in 14 (3.2%); 1 study included both categories. Solely self-reported outcomes were used in 279 studies (63.4%). Among 372 uncontrolled studies, 244 (65.6%) did not explicitly acknowledge the absence of a baseline comparator as a limitation anywhere in text. Corresponding authors of 140 studies (31.8%) responded to the survey; among them, 126 (90.0%) reported no additional comparative data, while 14 (10.0%) mentioned some available comparative datasets (19 additional datasets). Almost all of that information (10/14, 17/19) had been already published in other articles not captured by the Hou et al. systematic review. Conclusions Most published long COVID prevalence studies lacked comparator groups and relied exclusively on self-reported outcomes without acknowledging this limitation. Direct author contact identified little additional comparator information. Much of the long COVID prevalence literature may therefore be poorly suited to estimating burden attributable specifically to SARS-CoV-2, underscoring the need for appropriately matched comparators and more objective outcome assessment. Registration The protocol was prospectively registered on the Open Science Framework (https://osf.io/f4hra).

10
ChooseMyStat: A Web-Based Interactive Tool for Statistical Test Selection and Analysis Plan Generation in Clinical Research

Srivastava, S.; Punyani, S. R.; Vazalwar, D.; Joshi, A.; Pakhare, A. P.

2026-06-03 medical education 10.64898/2026.06.02.26354730 medRxiv
Top 0.1%
1.7%
Show abstract

Background: Postgraduate medical residents frequently face difficulty in selecting appropriate statistical tests and preparing statistical analysis plans (SAPs) for thesis work. Existing resources often identify statistical tests without guiding implementation, reporting or software execution. Aims: To describe the development, features and content validation of ChooseMyStat, a free, open source, web based interactive tool for statistical test selection and SAP text generation in clinical research. Methods: ChooseMyStat was developed as a React based web application using an iterative, AI assisted development process under direct faculty supervision. The tool uses a branching decision algorithm covering 18 inferential statistical tests, two diagnostic accuracy measures, four agreement/reliability statistics, and four descriptive statistics scenarios. For each recommendation, it generates a SAP template paragraph, a results reporting example, step by step JASP instructions, and R code. Content validation was performed using 105 open-access original research articles from 15 broad medical specialties published in Indian journals during 2024 2025. Results: The tool covers commonly used statistical methods, including t tests, ANOVA, chi square variants, non parametric alternatives, correlation, regression (linear, logistic, ordinal), survival analysis, methods for clustered or repeated data, diagnostic accuracy measures, and agreement/reliability statistics. Among 365 statistical tests identified across 105 articles (excluding normality checking procedures), 346 (94.8%) were covered by the tool. Complete coverage of all statistical methods used was observed in 86 of 105 articles (81.9%). Conclusions: ChooseMyStat integrates statistical test selection with implementation guidance, SAP generation, reporting support and software instructions within a single interface. The tool may support postgraduate research training by improving accessibility to applied biostatistics guidance.

11
Large-Scale Assessment of Animal-to-Human Drug Translation Using Natural Language Processing

Doneva, S. E.; Ellendorff, T. R.; Schneider, G.; Held, L.; von Wyl, V.; Simpson, I.; Sick, B.; Ineichen, B. V.

2026-05-22 bioinformatics 10.64898/2026.05.20.726540 medRxiv
Top 0.1%
1.4%
Show abstract

BackgroundLarge-scale estimates of animal-to-human drug translation and the study characteristics associated with successful translation remain limited. The expanding preclinical literature also challenges manual evidence synthesis. We developed a natural language processing (NLP) pipeline to structure and link preclinical and clinical evidence at scale. MethodsIn this retrospective meta-research study, we analysed more than 500,000 neuroscience-related animal drug studies from PubMed and linked them to clinical trial and regulatory approval data. NLP methods extracted drug, disease, and experimental design characteristics from abstracts and full texts. Translation was defined as progression to completed phase III/IV trials or regulatory approval. Logistic regression assessed associations between preclinical study characteristics and successful translation. FindingsAmong 291,624 drug entities identified in animal studies, 6{middle dot}7% entered clinical development and 3{middle dot}1% reached phase III/IV trials or regulatory approval. At the drug-disease level, 4{middle dot}4% entered clinical development and 1{middle dot}9% achieved translation. Restricting analyses to successfully linked ontology entities increased estimates to 11{middle dot}3% and 4{middle dot}1%, respectively. Male-only animal studies predominated, whereas reporting of randomisation, blinding, and sample size calculations remained limited. Testing across multiple species and reporting blinding were associated with higher odds of successful translation. InterpretationOnly a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and blinding were associated with improved translational success. NLP-based evidence synthesis may support scalable evaluation of translational research and identification of potentially modifiable research practices. FundingSwiss National Science Foundation, UZH Digital Entrepreneurship Fellowship, Universities Federation for Animal Welfare. Research in contextO_ST_ABSEvidence before this studyC_ST_ABSWe searched the literature for studies quantifying large-scale animal-to-human translation and factors associated with successful translation. Existing work was mainly limited to specific diseases, interventions, or manually curated datasets, and large-scale linkage of animal and clinical evidence remained limited. Added value of this studyWe developed a natural language processing pipeline linking more than 500,000 animal studies to clinical trial and regulatory approval data. The study provides large-scale estimates of translation and identifies experimental characteristics associated with successful translation. Implications of all the available evidenceThe findings suggest that only a minority of interventions tested in animals progress to advanced clinical development or regulatory approval. Greater species diversity and reporting of blinding were associated with improved translation. Automated evidence synthesis may support more systematic evaluation of translational research practices.

12
Keeping human in the loop: A three-phase generative AI workflow for research integrity in data-intensive science.A methodological case study using elite Ethiopian distance-running data

Galko, P.; Yisamaw, A.; Haugen, T.; Seiler, S.

2026-05-29 sports medicine 10.64898/2026.05.29.26354013 medRxiv
Top 0.2%
0.9%
Show abstract

Background: Generative AI tools can support data-intensive research by writing code, drafting prose, searching analytical possibilities, and stress-testing claims. They can also produce false citations, drift between statistical specifications, and lose continuity across long investigations. This paper describes a practical workflow for using AI systems in empirical research while keeping discovery, verification, and accountability inspectable. Methods: We developed and applied a three-phase human-AI workflow to a case study of 14 elite Ethiopian distance runners. The dataset contained 22,605 GPS-segments collected across 97 consecutive days in late 2025, supplemented by venue and athlete metadata collected in the field. Phase 1 used an autonomous data-exploration tool to pre-filter the hypothesis space across five seeded research questions. Phase 2 used an AI system under direct human guidance to construct candidate findings into numerical claims, verification scripts, and draft text. Phase 3 used an independent AI system in an adversarial role to stress-test methods, statistics, prose, figures, and citations. The workflow was informed by Pearl's distinction between association, intervention, and counterfactual reasoning, with human judgement retained for research direction, interpretation, and final claims. Results: The workflow produced three empirical analyses and a documented correction process. The analyses estimated an altitude-to-sea-level pace correction of +0.10 min/km per 1,000 m at matched heart rate, showed why pooled altitude-surface regression was not identifiable within this venue system, documented method-dependence in heart-rate-based intensity classification, characterised within-venue route variation as a 64/36 path-fixed-to-trail-variable split with the Sululta label resolving into two functionally distinct sub-venues, and reframed the cohort's training through a 3x3x3 prescription lattice grounded in Ethiopian coaching practice. The adversarial phase identified several hallucinated citations, a terminology error between HC1 and cluster-robust standard errors, and several inconsistencies between prose, figures, and computed results. Verification scripts re-derived nearly all numerical claims from the cleaned lap-level data. Conclusions: The case study shows how researchers can organise AI-assisted empirical work so that candidate discovery, claim construction, independent stress-testing, and final accountability remain separated. The workflow did not remove the need for domain expertise or human judgement. Its value was in making the route from candidate finding to manuscript claim explicit, reproducible, and open to challenge. Trial registration: Not applicable.

13
Adherence to data-sharing policies - a comparison of the BMJ with other major medical journals

Avenell, A.; Bishop, D.

2026-05-21 medical ethics 10.64898/2026.05.15.26353284 medRxiv
Top 0.2%
0.9%
Show abstract

Background: In 2024, the BMJ updated its data-sharing policy for clinical trials, requiring deidentified individual patient data (IPD) to be openly deposited prior to publication. Our objective was to discover if data-sharing increased after introduction of the new policy. Method: All data-sharing statements were downloaded from BMJ trials published in 2023 (submitted pre-updated policy) and 2025 (submitted post-updated policy). Data for 2025 were gathered for trials in five comparison medical journals. Data-sharing statements were coded to specify whether IPD were immediately available, and if not, the reason why. Where a statement gave a link to a repository, we checked whether data were available. Results: Openly available IPD for BMJ trials increased from 0/32 prior to the new policy to 19/33 (58%) after the updated policy; seven articles gave repository links that did not yield any data. In the five comparison journals, rates of open IPD varied from 0% to 5.6%. Conclusions: There was a substantial increase in open sharing of IPD after introduction of the new policy compared to a prior period. Open sharing of IPD is possible, but it is unpopular with authors and is unlikely to be achieved without firm editorial enforcement

14
A Multi-Agent RAG Framework for Biomedical Literature Analysis

Palem, R. R.; Chen, H.; Yue, Z.

2026-05-29 bioinformatics 10.64898/2026.05.26.727050 medRxiv
Top 0.2%
0.8%
Show abstract

BackgroundThe biomedical literature is expanding at an unprecedented rate, with over 4,000 new articles indexed on PubMed each day. Clinicians and researchers frequently lack the time to review this volume before making decisions. Retrieval-Augmented Generation (RAG) systems attempt to bridge this gap by grounding language model responses in relevant documents, but standard implementations rank all retrieved passages solely by semantic similarity, treating a case report and a meta-analysis as equally authoritative. ObjectiveWe aimed to develop and pilot-evaluate a RAG variant that incorporates evidence quality and publication recency into the retrieval scoring function, and to determine whether these signals improve answer quality on biomedical questions compared with standard cosine similarity RAG and a full-context baseline. MethodsWe developed ET-RAG (Evidence-Temporal RAG), which scores each retrieved chunk using a weighted combination of cosine similarity (50%), evidence quality based on the GRADE hierarchy (30%), and temporal recency (20%). We evaluated ET-RAG alongside two baselines: a full context agent powered by Gemini 2.0 Flash and a standard cosine RAG agent using GPT-4o-mini. All agents were tested on 40 benchmark questions (10 single-choice, 10 multiple-choice, 10 short answer, and 10 long answer) drawn from 10 peer-reviewed Alzheimers disease papers published between 2021 and 2025. ResultsET-RAG achieved the highest scores across all four question categories: single choice (0.90), multiple choice (0.74), short answer (0.92), and long answer (0.89), with a combined average of 0.86. Cosine RAG scored 80%, 0.48, 0.82, and 0.69, respectively (average 0.70), while the full context agent scored 0.60, 0.59, 0.71, and 0.53 (average 0.61). The full context agent, despite having access to the entire corpus through Geminis large context window, struggled with consistent answer extraction and was prone to rate limiting under heavy query loads. A control question on forestry was correctly rejected by all three agents, suggesting no hallucination on this control item. ConclusionsIn this pilot Alzheimers disease benchmark, incorporating evidence quality and recency into RAG retrieval improved answer quality relative to pure cosine similarity retrieval and full-corpus prompting. The evidence-temporal scoring function is lightweight to implement and adds minimal computational overhead to existing vector search pipelines, but broader validation across domains, evidence levels, and stronger retrieval baselines are required before claims of generalizable biomedical reliability can be made.

15
Positioning Early Phase CNS Trials for Regulatory and Investor Success: Strategic Implications of the Single Phase 3 Approval Paradigm

Schmidt, P.; Preskorn, S.

2026-06-08 neurology 10.64898/2026.06.05.26353604 medRxiv
Top 0.2%
0.8%
Show abstract

In February 2026, the FDA announced that a single pivotal phase 3 (P3) trial would become the new default standard for drug approval - a regulatory direction that had been legally enabled since the FDA Modernization Act of 1997. This announcement has strategic, scientific, and economic implications for drug developers, contract research organizations (CROs), and biotech investors. We argue that the expansion of this framework, originally reserved for various niche submissions, represents a paradigm change, dramatically increasing the value of rigorous early phase (P1 and P2) trial design, requiring sponsors to establish both statistical efficacy signals and mechanistic biological understanding before entering phase 3. Using a CNS indication cost model, we show that single P3 approval can reduce total development expenditure from approximately $447 million over 14 years to $297 million over 12 years - a savings of $150 million and providing two years of additional commercial runway for a modeled CNS drug. Case examples including lecanemab, omaveloxolone, and tofersen illustrate how biomarker-informed early phase strategies can establish the confirmatory evidence necessary for single-trial approval. We provide practical guidance for maximizing the value of P1 and P2 under this evolving framework.

16
CellExLink: End-to-end cell-type recognition and normalization in biomedical text

Nabijiang, A.; Shahriyari, L.

2026-05-29 bioinformatics 10.64898/2026.05.26.728013 medRxiv
Top 0.3%
0.4%
Show abstract

Since cells are the main components of many biological and biomedical studies, cell-type extraction is an important task in biomedical text mining. However, current biomedical text-mining systems either do not explicitly support cell-type extraction, provide limited support for Cell Ontology normalization, or show limited performance in end-to-end cell-type extraction. These limitations can affect downstream tasks that depend on reliable cell-type information. Here, we present CellExLink, an end-to-end biomedical natural language processing pipeline designed specifically for cell-type recognition and Cell Ontology normalization in biomedical text. The pipeline is designed to improve extraction accuracy and practical usability in literature-mining workflows, while accounting for computational efficiency in its recognition and normalization design. We evaluate CellExLink across heterogeneous biomedical corpora and compare it with established and recent biomedical text-mining tools. The results show that CellExLink provides reliable cell-type recognition, Cell Ontology normalization, and end-to-end extraction across these corpora. By addressing the need for reliable end-to-end cell-type recognition and Cell Ontology normalization, CellExLink can support downstream tasks such as curation, search, relation extraction, and knowledge graph construction. Author summaryCell types are central to biomedical research, but biomedical papers often use different names, abbreviations, and synonyms for the same cell type. This variation makes it difficult for automated processes to collect and compare cell-type information across papers. Reliable automated extraction is important because literature mining requires consistent cell-type identification before evidence from different studies can be searched, integrated, or reused. Existing off-the-shelf biomedical text-mining tools provide useful functionality, but their ability to support cell-type extraction remains limited and inconsistent. To address this gap, we developed CellExLink, a pipeline that finds cell-type entities in biomedical text and links them to standard Cell Ontology identifiers. We evaluated the pipeline on several biomedical corpora and compared it with existing tools that support cell-type extraction. Across these evaluations, CellExLink showed clear accuracy gains in both detecting cell-type entities and assigning correct standard identifiers. Together, these gains make CellExLink a powerful tool for extracting reliable standardized cell-type information from large collections of papers, supporting literature curation, relation extraction, knowledge graph construction, and studies of cell-type-specific roles in diseases, drug responses, and biological pathways.

17
Generative Artificial Intelligence in Medical Education and Participatory Research for Social Action: A Human and AI Comparative Analysis

Juniu, S.; Castor, D.; Reyes Nieva, H.; Charon, R.; Amesty, S.

2026-05-21 medical education 10.64898/2026.05.14.26351842 medRxiv
Top 0.3%
0.4%
Show abstract

Participatory qualitative methods such as Photovoice are increasingly used to link research with social action. Recent advances in artificial intelligence (AI) may enhance data analysis, inference, and action planning within such participatory approaches. This study explored medical students' perceptions of social justice using conventional Photovoice analysis and assessed the potential contribution of generative AI (genAI). Nine students joined a six-week seminar, "Exploring the Concept of Social Justice Using Photovoice." An initial two-hour session covered ethics, the Photovoice framework, and photography techniques. Participants then captured images reflecting their views on social justice, wrote narratives, and engaged in guided group discussions. Human researchers and students conducted a three-stage Photovoice analysis: 1) selecting photographs, 2) contextualizing them with participant narratives, and 3) inductively coding themes. To explore how AI might support data analysis, the research team analyzed the same data with five generative tools including Sonix, ChatGPT, and Copilot. AI-generated themes and visual representations were compared with human-derived results for congruence, depth, and suggested action steps. Conventional analysis identified five major themes: (1) Social Justice and Inequality, (2) Contradictions and the Costs of Justice, (3) Community and Collective Action, (4) Environment and Environmental Justice, and (5) Perception, Subjectivity, and Perspective. AI-assisted analysis yielded six unified themes that closely aligned with human findings. Traditional Photovoice images conveyed authentic, lived experiences and strong emotional meaning, providing a powerful foundation for advocacy. AI-generated images and thematic summaries offered efficiency, creativity, and reduced researcher bias, improving generalizability. However, they lacked the emotional depth and contextual nuance present in participant-created visuals.

18
Machine-Assisted Topic Analysis of Large-Scale Health Experience Data: Identifying Sociodemographic Differences and Evaluating Bias in Large Language Models

Bondaronek, P.; Ward, E.; Beecham, E.; Zhang, E.; Huang, Y.; Ive, J.; Naughton, F.; Wu, H.; Vindrola-Padros, C.

2026-05-22 public and global health 10.64898/2026.05.20.26353755 medRxiv
Top 0.3%
0.3%
Show abstract

Introduction: Large-scale free-text data with socio-demographic information can capture nuanced accounts of lived experience that are difficult to detect in structured measures. However, manual qualitative analysis is difficult to scale, while automated approaches may obscure subgroup variation or introduce bias. This is especially relevant for large language models (LLMs), whose use in qualitative health research is increasing despite limited evaluation in socio-demographically stratified analysis. Objectives: This study examined how socio-demographic differences in health and wellbeing experiences were manifested in a large-scale free-text dataset, and evaluated how different AI-assisted analytic approaches identified these differences. Specifically, it aimed to: (1) identify socio-demographic differences using Machine-Assisted Topic Analysis (MATA); (2) compare MATA outputs with topic modelling combined with LLM-based topic interpretation; and (3) examine potential bias in LLM-based analysis. Methods: We analysed 2,177 valid free-text responses from the UK COVID-19 Wellbeing Tracker, a longitudinal survey of adults recruited during the pandemic. Responses described factors influencing health behaviours, mood, and wellbeing over time. Data were preprocessed and stratified by gender, age, and socioeconomic status (SES). MATA combined topic modelling, using Latent Dirichlet Allocation, with humanled qualitative interpretation of topic keywords and representative responses. The same topic model outputs were then interpreted using an LLM for comparison. Potential LLM bias was assessed using a demographic label-swap crossover design, with bias evaluated through Jaccard lexical similarity, VADER sentiment, and NRC emotion analysis. Grounded Review and Assessment of Computational Evidence (GRACE) was used to evaluate the AI outputs. Powered by Editorial Manager(R) and ProduXion Manager(R) from Aries Systems Corporation Results: MATA identified meaningful socio-demographic thematic differences in pandemic-related mood and wellbeing across gender, age, and SES. Common themes included disruption, adaptation, uncertainty, routine, and the influence of work, relationships, and health on wellbeing. Male-stratified topics emphasised routines, habits, and coping with external pressures, whereas female-stratified topics were more relational and reflective, focusing on connection, isolation, family wellbeing, and anxiety. Lower SES narratives included practical strain, financial pressure, and loss of control, while higher SES narratives more often reflected adjustment, autonomy, and meaning-making. Older adults described health, gratitude, and family connection, whereas younger adults emphasised work-related stress and competing demands. LLM-based interpretation broadly reproduced the high-level subgroup patterns identified through MATA, but outputs were more generalised, less conceptually differentiated, and showed greater thematic overlap. Bias analysis showed systematic shifts in vocabulary, sentiment, and emotional tone when demographic labels were swapped, suggesting a risk of representational bias. Conclusions: MATA identified meaningful socio-demographic differences while retaining interpretative depth at scale. LLM-based topic interpretation showed utility for rapid thematic summarisation, but produced less conceptually differentiated outputs and was sensitive to demographic framing. The analysis also identified "LLM speak", where outputs appeared coherent but relied on abstract, generalised, and overlapping interpretations. Human oversight, structured qualitative appraisal, and explicit bias evaluation are necessary when using LLMs to analyse socially stratified free-text health data.

19
GeneKnow: AI-powered literature synthesis for gene-context analysis

Zhang, H.; Zang, C.

2026-06-01 bioinformatics 10.64898/2026.05.28.728511 medRxiv
Top 0.3%
0.3%
Show abstract

Interpreting gene function in specific biological contexts is essential for biomedical research, yet manual literature review is labor-intensive. We developed GeneKnow, a source-grounded framework that uses generative AI models within a controlled hybrid workflow to produce reliable, traceable literature synthesis supported by authentic citations. Through systematic benchmarking, we showed that GeneKnow outperforms mainstream web-interface AI tools in generating trustworthy context-specific gene function syntheses without fabricated citations and minimizing hallucinations.

20
Widespread use of invalid statistical tests in biomedical machine learning

Zeng, T.; Li, H.; Zhang, S.; Tan, Y. Q.; Tian, F.; Orban, C.; An, L.; Che, W.; Cheng, J.; Chong, J. S. X.; Dehestani, N.; Dong, Z.; Li, X.; Li, Z.; Lim, M. J. R.; Lin, Y.; Ling, Q.; Ling, Z.; Low, X. Z.; Mansour L., S.; Ng, K. K.; Nguyen, T. T.; Ooi, L. Q. R.; Pande, S.; Qian, X.; Ruan, J.; Wang, Z.; Xie, Y.; Zhang, C.; Zhang, Y.; Patil, K.; Parkes, L.; Dhamala, E.; Chopra, S.; Zalesky, A.; Holmes, A.; Eickhoff, S.; Zhou, J. H.; Renaud, O.; Dosenbach, N.; Kording, K. P.; Bzdok, D.; Nichols, T.; Yeo, B. T. T.

2026-05-20 bioinformatics 10.64898/2026.05.17.724301 medRxiv
Top 0.3%
0.3%
Show abstract

Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance - not only to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor [≥]15, 1 June 2020 - 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.